Generalized Substring Compression
نویسندگان
چکیده
In substring compression one is given a text to preprocess so that, upon request, a compressed substring is returned. Generalized substring compression is the same with the following twist. The queries contain an additional context substring (or a collection of context substrings) and the answers are the substring in compressed format, where the context substring is used to make the compression more efficient. We focus our attention on generalized substring compression and present the first non-trivial correct algorithm for this problem. In our algorithm we inherently propose a method for finding the bounded longest common prefix of substrings, which may be of independent interest. In addition, we propose an efficient algorithm for substring compression which makes use of range searching for minimum queries. We present several tradeoffs for both problems. For compressing the substring S[i . . j] (possibly with the substring S[α . . β] as a context), best query times we achieve are O(C) and O ( C log ( j−i C )) for substring compression query and generalized substring compression query, respectively, where C is the number of phrases encoded.
منابع مشابه
Efficient VLSI Architecture for Lossless Data Compression
An architecture for LZ1-type lossless data compression is described. The architecture is area-efficient and fast since it exploits the locality of substring match lengths. The property has been shown experimentally for various data and buffer lengths, and an architecture based on it has been designed.
متن کاملFinding Synchronization Codes to Boost Compression by Substring Enumeration
Synchronization codes are frequently used in numerical data transmission and storage. Compression by Substring Enumeration (CSE) is a new lossless compression scheme that has turned into a new and unusual application for synchronization codes. CSE is an inherently bitoriented technique. However, since the usual benchmark files are all byte-oriented, CSE incurred a penalty due to a problem calle...
متن کاملFinding Characteristic Substrings from Compressed Texts
Text mining from large scaled data is of great importance in computer science. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeating substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel p...
متن کاملString Noninclusion Optimization Problems
For every string inclusion relation there are two optimization problems: find a longest string included in every string of a given finite language, and find a shortest string including every string of a given finite language. As an example, the two well-known pairs of problems, the longest common substring (or subsequence) problem and the shortest common superstring (or supersequence) problem, ...
متن کاملImplementation of Delta Compression
A Matlab simulation is carried out to verify the compression ratio analysis. Packet Xk and V are generated as two i.i.d. random sequences that follow a discrete uniform distribution between 0 and 255 with a packet length of 1,500 bytes. Packet Xk+1 is generated according to the simplified content generation model. Considering Xk and Xk+1 as two byte strings, our lossless delta compression algor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009